Effcient Algorithms for Mining Maximal Flexible Patterns in Texts and Sequences
نویسندگان
چکیده
In this paper, we study the maximal pattern discovery problem in a given sequence for the class ERP of flexible patterns with applications to text mining, where a flexible pattern is a sequence of constant and wildcards for possibly empty strings such as AB*B*ABC, and also known as erasing regular patterns. We first discuss the framework of optimal pattern discovery for predictive mining and text classification, and then show its connection to maximal pattern discovery. Then, we introduce a new notion of maximality of patterns based on the position occurrences of patterns, called position-maximality. We present an efficient algorithm PosMaxFlexMotif that, given an input string of length n over an alphabet Σ, enumerates all maximal patterns of ERP without duplicates in O(kmn2) time per maximal pattern using O(mn) space, where m = |P | is the size of the pattern P to be enumerated, and k = O(m) is the number of variables in P . This implies as corollary that the position-maximal enumeration problem for flexible patterns is output-polynomial time solvable. Then, we apply the above result to maximal pattern discovery in terms of the maximality based on document occurrence as a sound pruning technique.
منابع مشابه
High Fuzzy Utility Based Frequent Patterns Mining Approach for Mobile Web Services Sequences
Nowadays high fuzzy utility based pattern mining is an emerging topic in data mining. It refers to discover all patterns having a high utility meeting a user-specified minimum high utility threshold. It comprises extracting patterns which are highly accessed in mobile web service sequences. Different from the traditional fuzzy approach, high fuzzy utility mining considers not only counts of mob...
متن کاملA SAT model to mine flexible sequences in transactional datasets
Traditional pattern mining algorithms generally suffer from a lack of flexibility. In this paper, we propose a SAT formulation of the problem to successfully mine frequent flexible sequences occurring in transactional datasets. Our SAT-based approach can easily be extended with extra constraints to address a broad range of pattern mining applications. To demonstrate this claim, we formulate and...
متن کاملQuery Driven Sequence Pattern Mining
The discovery of frequent patterns present in biological sequences has a large number of applications, ranging from classification, clustering and understanding sequence structure and function. This paper presents an algorithm that discovers frequent sequence patterns (motifs) present in a query sequence in respect to a database of sequences. The query is used to guide the mining process and th...
متن کاملEfficient Algorithms for Discovering Frequent and Maximal Substructures from Large Semistructured Data
In this paper, we review recent advances in efficient algorithms for semi-structured data mining , that is, discovery of rules and patterns from structured data such as sets, sequences, trees, and graphs. After introducing basic definitions and problems, We present efficent algorithms for frequent and maximal pattern mining for classes of sets, sequences, and trees. In particular, we explain ge...
متن کاملCommon Zero Points of Two Finite Families of Maximal Monotone Operators via Proximal Point Algorithms
In this work, it is presented iterative schemes for achieving to common points of the solutions set of the system of generalized mixed equilibrium problems, solutions set of the variational inequality for an inverse-strongly monotone operator, common fixed points set of two infinite sequences of relatively nonexpansive mappings and common zero points set of two finite sequences of maximal monot...
متن کامل